Question 1

Declaration

This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made

Section 1

This report is designed to meet the requests of a panel of politicians and managers of the Food Standards Agency, undertaking the specific analyses requested. This report aims to:

  1. Visualize the distributions across local authorities of the % of enforcement actions successfully achieved.

  2. Examine whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions by:

  1. Determining whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.

  2. Determining whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority

options(width=100)

#Installing necessary packages
install.packages("moments")
install.packages("car")
install.packages('tidyverse')
install.packages('gridExtra')
install.packages('grid')
install.packages("ggpubr")
install.packages("MASS")
install.packages("lmtest")
install.packages('lubridate')
install.packages('emmeans')
install.packages('kableExtra')
install.packages('Rmisc')
install.packages('Hmisc')
#importing libraries
library(car)
library(Rmisc)                    
library(kableExtra)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(emmeans)
library(dplyr)
library(moments)
library(grid)
library(ggpubr)
library(MASS)
library(lmtest)

Data Dictionary

This dataset is provided by the Food Standards Agency. Each row of the dataset contains details for a particular local Authority. The variables in the dataset as described below.

Original Variable Name New Variable Name Description
Country Country Country in which the Local Authority is in
LAType LAType Type of the local Authority
LAName LAName Name of the local authority.
Totalestablishments(includingnotyetrated&outside) Total.establishments Total number of establishments within the local authority (including those that were yet to be rated and those which are outside the rating programme)
Establishmentsnotyetratedforintervention Establishments.not.yet.rated Number of establishments that are yet to be rated for intervention but are part of the programme
Establishmentsoutsidetheprogramme Establishments.outside.the.programme Number of establishments that are outside the programme
Total%ofBroadlyCompliantestablishmentsratedA-E perc.of.broadly.comp.rated.est Percentage of Rated Establishments that are broadly compliant with the regulations out of the total number of Rated Establishments in that Local Authority.
Total%ofBroadlyCompliantestablishments(includingnotyetrated) perc.of.broadly.comp.rated.and.unrated.est Percentage of Rated Broadly Compliant Establishments out of total number of establishments that are part of the programme, including those which were not yet rated.
Aratedestablishments A.rated.establishments Number of Establishments that are rated A (greatest potential impact) in that Local Authority.
Total%ofBroadlyCompliantestablishments-A perc.of.broadly.comp.A.rated.est Percentage of establishments that are broadly compliant out of total number of establishments that were rated A in that Local Authority.
Bratedestablishments B.rated.establishments Number of establishments that are rated B in the Area
Total%ofBroadlyCompliantestablishments-B perc.of.broadly.comp.B.rated.est Percentage of establishments that are broadly compliant out of total number of establishments that were rated B in that Local Authority.
Cratedestablishments C.rated.establishments Number of establishments that are rated C in the Area
Total%ofBroadlyCompliantestablishments-C perc.of.broadly.comp.C.rated.est Percentage of establishments that are broadly compliant out of total number of establishments that were rated C in that Local Authority.
Dratedestablishments D.rated.establishments Number of establishments that are rated D in the Area
Total%ofBroadlyCompliantestablishments-D perc.of.broadly.comp.D.rated.est Percentage of D rated establishments that are broadly compliant out of total number of establishments that were rated D in that Local Authority.
Eratedestablishments E.rated.establishments Number of establishments that are rated E (least potential impact) in the Area
Total%ofBroadlyCompliantestablishments-E perc.of.broadly.comp.E.rated.est Percentage of E rated establishments that are broadly compliant out of total number of establishments that were rated E in that Local Authority.
Total%ofInterventionsachieved(premisesratedA-E) perc.interventions.achieved.by.rated.est Total Percentage of Interventions achieved by rated establishments out of the total number of interventions received by rated establishments
Total%ofInterventionsachieved-premisesratedA perc.interventions.achieved.by.A.rated.est Total Percentage of Interventions achieved by A rated establishments out of the total number of interventions received by A-rated establishments
Total%ofInterventionsachieved-premisesratedB perc.interventions.achieved.by.B.rated.est Total Percentage of Interventions achieved by B rated establishments out of the total number of interventions received by B-rated establishments
Total%ofInterventionsachieved-premisesratedC perc.interventions.achieved.by.C.rated.est Total Percentage of Interventions achieved by C rated establishments out of the total number of interventions received by C-rated establishments
Total%ofInterventionsachieved-premisesratedD perc.interventions.achieved.by.D.rated.est Total Percentage of Interventions achieved by D rated establishments out of the total number of interventions received by D-rated establishments.
Total%ofInterventionsachieved-premisesratedE perc.interventions.achieved.by.E.rated.est Total Percentage of Interventions achieved by E rated establishments out of the total number of interventions received by E-rated establishments
Total%ofInterventionsachieved-premisesnotyetrated perc.interventions.achieved.by.unrated.est Total Percentage of Interventions achieved by establishments which were not yet rated out of the total number of interventions received by establishments which were not yet rated.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Voluntaryclosure est.subject.to.enforcement.Voluntary.closure Total Number of Establishments that were subject to Voluntary closure as a formal enforcement action
Totalnumberofestablishmentssubjecttoformalenforcementactions-Seizure,detention&surrenderoffood est.subject.to.enforcement.Seizure.detention.and.surrender.of.food Total Number of Establishments that are subject to Seizures, Detentions or Surrender of food as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Suspension/revocationofapprovalorlicence est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence Total Number of Establishments that are subject to Suspension or revocation of their approval licence as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Hygieneemergencyprohibitionnotice est.subject.to.enforcement.Hygiene.prohibition.notice Total Number of Establishments that are subject to getting Hygiene Emergency Prohibition Notice as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Prohibitionorder est.subject.to.enforcement.Prohibition.order Total Number of Establishments that are subject to getting a Prohibition Order as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Simplecaution est.subject.to.enforcement.Simple.caution Total Number of Establishments that were subject to Simple Caution as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Hygieneimprovementnotices est.subject.to.enforcement.Hygiene.improvement.notices Total Number of Establishments that are subject to Hygiene improvement notices as formal enforcement actions.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Remedialaction&detentionnotices est.subject.to.enforcement.Remedial.action.and.detention.notices Total Number of Establishments that are subject to remedial action and detention notices as formal enforcement actions.
TotalnumberofestablishmentssubjecttoWrittenwarnings est.subject.to.written.warnings Total Number of Establishments that were subject to Written Warnings.
Totalnumberofestablishmentssubjecttoformalenforcementactions-Prosecutionsconcluded est.subject.to.enforcement.Prosecutions.concluded Total Number of Establishments for which Prosecutions were concluded.
ProfessionalFullTimeEquivalentPosts-occupied Number.of.FTE.food.safety.employees Total Number of Professional Full Time Posts that are currently occupied in the Local Authority Office

Data Understanding and Preparation

#Reading the data
dataset <- read_csv("2019-20-enforcement-data-food-hygiene.csv")

#Renaming the columns with the new names
names(dataset) <- c("Country", "LAType", "LAName", "Total.establishments", "Establishments.not.yet.rated", "Establishments.outside.the.programme", "perc.of.broadly.comp.rated.est", "perc.of.broadly.comp.rated.and.unrated.est", "A.rated.establishments", "perc.of.broadly.comp.A.rated.est", "B.rated.establishments", "perc.of.broadly.comp.B.rated.est", "C.rated.establishments", "perc.of.broadly.comp.C.rated.est","D.rated.establishments","perc.of.broadly.comp.D.rated.est","E.rated.establishments","perc.of.broadly.comp.E.rated.est", "perc.interventions.achieved.by.rated.est", "perc.interventions.achieved.by.A.rated.est","perc.interventions.achieved.by.B.rated.est","perc.interventions.achieved.by.C.rated.est","perc.interventions.achieved.by.D.rated.est","perc.interventions.achieved.by.E.rated.est", "perc.interventions.achieved.by.unrated.est", "est.subject.to.enforcement.Voluntary.closure", "est.subject.to.enforcement.Seizure.detention.and.surrender.of.food", "est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence", "est.subject.to.enforcement.Hygiene.prohibition.notice", "est.subject.to.enforcement.Prohibition.order", "est.subject.to.enforcement.Simple.caution", "est.subject.to.enforcement.Hygiene.improvement.notices", "est.subject.to.enforcement.Remedial.action.and.detention.notices", "est.subject.to.written.warnings", "est.subject.to.enforcement.Prosecutions.concluded", "Number.of.FTE.food.safety.employees" )

#Checking its Summary
summary(dataset)
##    Country             LAType             LAName          Total.establishments
##  Length:353         Length:353         Length:353         Min.   : 145.0      
##  Class :character   Class :character   Class :character   1st Qu.: 920.5      
##  Mode  :character   Mode  :character   Mode  :character   Median :1330.0      
##                                                           Mean   :1620.7      
##                                                           3rd Qu.:2004.5      
##                                                           Max.   :9277.0      
##                                                           NA's   :6           
##  Establishments.not.yet.rated Establishments.outside.the.programme perc.of.broadly.comp.rated.est
##  Min.   :   0.00              Min.   :  0.00                       Min.   : 74.61                
##  1st Qu.:  25.00              1st Qu.:  0.00                       1st Qu.: 95.37                
##  Median :  49.00              Median :  2.00                       Median : 97.13                
##  Mean   :  89.75              Mean   : 49.62                       Mean   : 96.33                
##  3rd Qu.: 100.00              3rd Qu.: 39.00                       3rd Qu.: 98.19                
##  Max.   :1744.00              Max.   :865.00                       Max.   :100.00                
##  NA's   :6                    NA's   :6                            NA's   :6                     
##  perc.of.broadly.comp.rated.and.unrated.est A.rated.establishments perc.of.broadly.comp.A.rated.est
##  Min.   :69.45                              Min.   : 0.000         Length:353                      
##  1st Qu.:89.23                              1st Qu.: 1.000         Class :character                
##  Median :92.80                              Median : 2.000         Mode  :character                
##  Mean   :91.54                              Mean   : 4.285                                         
##  3rd Qu.:95.16                              3rd Qu.: 5.000                                         
##  Max.   :99.87                              Max.   :72.000                                         
##  NA's   :6                                  NA's   :6                                              
##  B.rated.establishments perc.of.broadly.comp.B.rated.est C.rated.establishments
##  Min.   :  2.00         Min.   :  5.88                   Min.   :  18.0        
##  1st Qu.: 23.00         1st Qu.: 55.62                   1st Qu.: 144.0        
##  Median : 39.00         Median : 69.23                   Median : 225.0        
##  Mean   : 55.61         Mean   : 67.34                   Mean   : 302.6        
##  3rd Qu.: 68.50         3rd Qu.: 80.15                   3rd Qu.: 376.0        
##  Max.   :516.00         Max.   :100.00                   Max.   :1647.0        
##  NA's   :6              NA's   :6                        NA's   :6             
##  perc.of.broadly.comp.C.rated.est D.rated.establishments perc.of.broadly.comp.D.rated.est
##  Min.   : 71.96                   Min.   :  56.0         Min.   : 75.74                  
##  1st Qu.: 89.36                   1st Qu.: 307.0         1st Qu.: 97.94                  
##  Median : 92.86                   Median : 462.0         Median : 99.03                  
##  Mean   : 92.02                   Mean   : 553.6         Mean   : 98.38                  
##  3rd Qu.: 95.50                   3rd Qu.: 682.0         3rd Qu.: 99.60                  
##  Max.   :100.00                   Max.   :3053.0         Max.   :100.00                  
##  NA's   :6                        NA's   :6              NA's   :6                       
##  E.rated.establishments perc.of.broadly.comp.E.rated.est perc.interventions.achieved.by.rated.est
##  Min.   :  63.0         Min.   : 79.22                   Min.   : 20.64                          
##  1st Qu.: 362.5         1st Qu.: 99.83                   1st Qu.: 81.81                          
##  Median : 480.0         Median :100.00                   Median : 90.82                          
##  Mean   : 565.3         Mean   : 99.82                   Mean   : 86.62                          
##  3rd Qu.: 667.5         3rd Qu.:100.00                   3rd Qu.: 95.39                          
##  Max.   :4309.0         Max.   :100.00                   Max.   :100.00                          
##  NA's   :6              NA's   :6                        NA's   :6                               
##  perc.interventions.achieved.by.A.rated.est perc.interventions.achieved.by.B.rated.est
##  Length:353                                 Min.   : 50.00                            
##  Class :character                           1st Qu.: 93.53                            
##  Mode  :character                           Median : 97.75                            
##                                             Mean   : 95.25                            
##                                             3rd Qu.:100.00                            
##                                             Max.   :100.00                            
##                                             NA's   :6                                 
##  perc.interventions.achieved.by.C.rated.est perc.interventions.achieved.by.D.rated.est
##  Min.   : 18.37                             Min.   : 19.77                            
##  1st Qu.: 89.27                             1st Qu.: 82.00                            
##  Median : 94.97                             Median : 91.72                            
##  Mean   : 91.84                             Mean   : 86.32                            
##  3rd Qu.: 97.77                             3rd Qu.: 95.98                            
##  Max.   :100.00                             Max.   :100.00                            
##  NA's   :6                                  NA's   :6                                 
##  perc.interventions.achieved.by.E.rated.est perc.interventions.achieved.by.unrated.est
##  Min.   :  1.81                             Min.   :  6.56                            
##  1st Qu.: 65.39                             1st Qu.: 85.81                            
##  Median : 87.50                             Median : 99.75                            
##  Mean   : 77.37                             Mean   : 90.95                            
##  3rd Qu.: 95.90                             3rd Qu.:100.00                            
##  Max.   :100.00                             Max.   :100.00                            
##  NA's   :6                                  NA's   :6                                 
##  est.subject.to.enforcement.Voluntary.closure
##  Min.   : 0.000                              
##  1st Qu.: 0.000                              
##  Median : 1.000                              
##  Mean   : 2.712                              
##  3rd Qu.: 3.000                              
##  Max.   :62.000                              
##  NA's   :6                                   
##  est.subject.to.enforcement.Seizure.detention.and.surrender.of.food
##  Min.   : 0.000                                                    
##  1st Qu.: 0.000                                                    
##  Median : 0.000                                                    
##  Mean   : 1.199                                                    
##  3rd Qu.: 1.000                                                    
##  Max.   :52.000                                                    
##  NA's   :6                                                         
##  est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence
##  Min.   :0.00000                                                           
##  1st Qu.:0.00000                                                           
##  Median :0.00000                                                           
##  Mean   :0.06916                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :7.00000                                                           
##  NA's   :6                                                                 
##  est.subject.to.enforcement.Hygiene.prohibition.notice est.subject.to.enforcement.Prohibition.order
##  Min.   : 0.0000                                       Min.   :0.0000                              
##  1st Qu.: 0.0000                                       1st Qu.:0.0000                              
##  Median : 0.0000                                       Median :0.0000                              
##  Mean   : 0.7118                                       Mean   :0.1354                              
##  3rd Qu.: 0.0000                                       3rd Qu.:0.0000                              
##  Max.   :42.0000                                       Max.   :6.0000                              
##  NA's   :6                                             NA's   :6                                   
##  est.subject.to.enforcement.Simple.caution est.subject.to.enforcement.Hygiene.improvement.notices
##  Min.   : 0.0000                           Min.   : 0.000                                        
##  1st Qu.: 0.0000                           1st Qu.: 1.000                                        
##  Median : 0.0000                           Median : 4.000                                        
##  Mean   : 0.4409                           Mean   : 7.527                                        
##  3rd Qu.: 0.0000                           3rd Qu.: 9.000                                        
##  Max.   :14.0000                           Max.   :77.000                                        
##  NA's   :6                                 NA's   :6                                             
##  est.subject.to.enforcement.Remedial.action.and.detention.notices est.subject.to.written.warnings
##  Min.   :0.0000                                                   Min.   :  30.0                 
##  1st Qu.:0.0000                                                   1st Qu.: 195.0                 
##  Median :0.0000                                                   Median : 340.0                 
##  Mean   :0.3314                                                   Mean   : 437.0                 
##  3rd Qu.:0.0000                                                   3rd Qu.: 543.5                 
##  Max.   :9.0000                                                   Max.   :3061.0                 
##  NA's   :6                                                        NA's   :6                      
##  est.subject.to.enforcement.Prosecutions.concluded Number.of.FTE.food.safety.employees
##  Min.   : 0.0000                                   Min.   : 0.65                      
##  1st Qu.: 0.0000                                   1st Qu.: 2.50                      
##  Median : 0.0000                                   Median : 3.41                      
##  Mean   : 0.6657                                   Mean   : 4.10                      
##  3rd Qu.: 1.0000                                   3rd Qu.: 5.00                      
##  Max.   :25.0000                                   Max.   :22.13                      
##  NA's   :6                                         NA's   :6
# Null Value Inspection and Treatment
# Filtering those rows having NA values in the data set. Note: Number of NA values are 6 in each continuous data column according to above summary.
filter(dataset, complete.cases(dataset) == FALSE)
## # A tibble: 6 × 36
##   Country LAType      LAName Total…¹ Estab…² Estab…³ perc.…⁴ perc.…⁵ A.rat…⁶ perc.…⁷ B.rat…⁸ perc.…⁹
##   <chr>   <chr>       <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>
## 1 England District C… Broxb…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 2 England District C… Chorl…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 3 England District C… East …      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 4 England District C… Hyndb…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 5 England District C… Tamwo…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 6 England Metropolit… West …      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## # … with 24 more variables: C.rated.establishments <dbl>, perc.of.broadly.comp.C.rated.est <dbl>,
## #   D.rated.establishments <dbl>, perc.of.broadly.comp.D.rated.est <dbl>,
## #   E.rated.establishments <dbl>, perc.of.broadly.comp.E.rated.est <dbl>,
## #   perc.interventions.achieved.by.rated.est <dbl>,
## #   perc.interventions.achieved.by.A.rated.est <chr>,
## #   perc.interventions.achieved.by.B.rated.est <dbl>,
## #   perc.interventions.achieved.by.C.rated.est <dbl>, …
#Deleting those rows
dataset <- drop_na(dataset)
#Checking for duplicates
length(unique(dataset$LAName)) == nrow(dataset)
## [1] TRUE
#number of records
nrow(dataset)
## [1] 347
#Converting categorical columns to factors
categorical_columns <- c("Country", "LAType")
dataset[categorical_columns] <- lapply(dataset[categorical_columns], as.factor)

#Deleting Redundant variable LAName from dataset
dataset$LAName <- NULL
#Checking for any alphabets in both of the columns "perc.of.broadly.comp.A.rated.est" and "perc.interventions.achieved.by.A.rated.est" 
characters_1 <- filter(dataset, grepl("^[A-Za-z]+$",dataset$perc.of.broadly.comp.A.rated.est))
levels(factor(characters_1$perc.of.broadly.comp.A.rated.est))
## [1] "NP"
characters_2 <- filter(dataset, grepl("^[A-Za-z]+$",dataset$perc.interventions.achieved.by.A.rated.est))
levels(factor(characters_2$perc.interventions.achieved.by.A.rated.est))
## [1] "NR"
#NP: No premises given at this risk rating
#NR: No interventions due or reported.

#Number of rows where dataset does not have NP and NR values
nrow(filter(dataset, perc.of.broadly.comp.A.rated.est == "NP" | perc.interventions.achieved.by.A.rated.est == "NR"))
## [1] 57
#Creating a seperate tibble where NR values are not there for visualizing purposes for perc.interventions.achieved.by.A.rated.est
visualization.tibble.for.A.rated <- dataset %>% 
  filter(perc.interventions.achieved.by.A.rated.est != "NR")

#Converting this column into numeric
visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est <- as.numeric(visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est)

Distribution of the Percent of Enforcement Actions that have been Achieved Successfully

#Visualizing the Distribution of the % of enforcement actions successfully achieved by all rated premises combined, across all Local Authorities

annotations <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.rated.est), 2)),y = c(0.08), label = c("Mean "))

ggplot(dataset, aes(perc.interventions.achieved.by.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "lightgrey") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby all premises', caption = "Figure 1. The distribution of enforcement actions successfully achieved across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)

#Visualizing the Distribution of the % of enforcement actions successfully achieved by A, B, C, D and E rated premises seperately across all Local Authorities

#Creating Histogram for A rated establishments
annotations_A <- data.frame(x = c(round(mean(visualization.tibble.for.A.rated$perc.interventions.achieved.by.A.rated.est), 2)),y = c(0.68), label = c("Mean  "))

Plot_A <- ggplot(visualization.tibble.for.A.rated, aes(perc.interventions.achieved.by.A.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "red") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby A-rated premises', caption = "Figure 2. The distribution of enforcement actions successfully achieved by A-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.A.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_A, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)


#Creating Histogram for B rated establishments

annotations_B <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.B.rated.est), 2)),y = c(0.4), label = c("Mean "))

Plot_B <- ggplot(dataset, aes(perc.interventions.achieved.by.B.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "orange") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby B-rated premises', caption = "Figure 3. The distribution of enforcement actions successfully achieved by B-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.B.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_B, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)


#Creating Histogram for C rated establishments

annotations_C <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.C.rated.est), 2)),y = c(0.15), label = c("Mean "))

Plot_c <- ggplot(dataset, aes(perc.interventions.achieved.by.C.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#0099F8") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby C-rated premises', caption = "Figure 4. The distribution of enforcement actions successfully achieved by C-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.C.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_C, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)


#Creating Histogram for D rated establishments

annotations_D <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.D.rated.est), 2)),y = c(0.08), label = c("Mean "))

Plot_D <- ggplot(dataset, aes(perc.interventions.achieved.by.D.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#FFFF00") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby D-rated premises', caption = "Figure 5. The distribution of enforcement actions successfully achieved by D-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.D.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_D, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)



#Creating Histogram for E rated establishments
annotations_E <- data.frame(x = c(round(mean(dataset$perc.interventions.achieved.by.E.rated.est), 2)),y = c(0.08), label = c("Mean "))

Plot_E <- ggplot(dataset, aes(perc.interventions.achieved.by.E.rated.est))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "#90EE90") + labs(x= '% of enforcement actions successfully achieved',y='Density', title = 'Distribution of Enforcement Actions successfully achieved\nby E-rated premises', caption = "Figure 6. The distribution of enforcement actions successfully achieved by E-Rated premises across all local authorities") + geom_vline(aes(xintercept = mean(perc.interventions.achieved.by.E.rated.est)), color = "black", size = 1, linetype = "dashed") + geom_text(data = annotations_E, aes(x = x, y = y, label = paste(label, x)), size = 5, fontface = "bold") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6)


#Returning the plots

ggarrange(Plot_A, Plot_B, Plot_c, Plot_D, Plot_E, nrow = 1, ncol = 1)
## $`1`

## 
## $`2`

## 
## $`3`

## 
## $`4`

## 
## $`5`

## 
## attr(,"class")
## [1] "list"      "ggarrange"
#Deleting rows in the original dataset where NR and NP are present
dataset <- dataset %>% 
  filter(perc.interventions.achieved.by.A.rated.est != "NR" & perc.of.broadly.comp.A.rated.est != "NP")
#Converting both of the above columns to numeric data type

columns <- c("perc.of.broadly.comp.A.rated.est", "perc.interventions.achieved.by.A.rated.est")

dataset[columns] <- lapply(dataset[columns], as.numeric)
#creating calculated columns necessary for further analysis



dataset <- dataset %>% 
  
          #no.of.rated.establishments.given.interventions: Number of rated establishments that are given interventions
  mutate(no.of.rated.establishments.given.interventions = (Total.establishments - Establishments.not.yet.rated - Establishments.outside.the.programme) * (100 - perc.of.broadly.comp.rated.est)/100, 
         
         #no.of.unrated.establishments.given.interventions: Number of unrated establishments that are given interventions
         no.of.unrated.establishments.given.interventions = ((Total.establishments - Establishments.outside.the.programme) * (100-perc.of.broadly.comp.rated.and.unrated.est)/100) - (no.of.rated.establishments.given.interventions), 
         
         
         
         #proportion.of.successful.responses: Number of rated and unrated establishments that successfully achieved the interventions divided by the total number of establishments that are a part of the programme. This ratio is expressed a a percentage.
         proportion.of.successful.responses = (((no.of.rated.establishments.given.interventions*perc.interventions.achieved.by.rated.est/100) + (no.of.unrated.establishments.given.interventions*perc.interventions.achieved.by.unrated.est/100))/(no.of.unrated.establishments.given.interventions + no.of.rated.establishments.given.interventions)) * 100,
         
         
         #total.no.of.establishments.given.interventions: Total Number of establishments that are given interventions
         total.no.of.establishments.given.interventions = no.of.rated.establishments.given.interventions + no.of.unrated.establishments.given.interventions, 
         
         #no.of.enforcement.actions.per.type= Average Number of establishments that were given enforcement actions for every type of enforcement action.
         no.of.enforcement.actions.per.type = (est.subject.to.enforcement.Voluntary.closure + est.subject.to.enforcement.Seizure.detention.and.surrender.of.food + est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence + est.subject.to.enforcement.Hygiene.prohibition.notice + est.subject.to.enforcement.Prohibition.order + est.subject.to.enforcement.Simple.caution + est.subject.to.enforcement.Hygiene.improvement.notices + est.subject.to.enforcement.Remedial.action.and.detention.notices + est.subject.to.written.warnings + est.subject.to.enforcement.Prosecutions.concluded)/10, 
         
         
         #Number.of.employees.per.establishment: Number of Employees allocated per establishment in a particular Local Authority
         Number.of.employees.per.establishment = Number.of.FTE.food.safety.employees/(Total.establishments - Establishments.outside.the.programme))

Correlation and Regression Analysis: Does employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions?

#Checking the distribution of both proportion of successful responses and Number of employees. 

histogram_proportion <- ggplot(dataset, aes(proportion.of.successful.responses))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "yellow") + labs(x= 'proportion of successful responses',y='Density', title = 'Distribution', caption = "Figure 7(a). Histogram for proportion of successful responses") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=60, y=0.05, label= sprintf("Skewness: %s", round(skewness(dataset$proportion.of.successful.responses), 2)))

qqplot_proportion <- ggplot(dataset, aes(sample=proportion.of.successful.responses)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 7(b). Q-Q Plot for proportion of successful reponses") + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size=7)
  )


histogram_employees <- ggplot(dataset, aes(Number.of.FTE.food.safety.employees))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees',y='Density', title = 'Distribution', caption = "Figure 8(a). Histogtam for number of FTE food safety employees") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=15, y=0.1, label= sprintf("Skewness: %s", round(skewness(dataset$Number.of.FTE.food.safety.employees), 2)))


qqplot_employees<- ggplot(dataset, aes(sample=Number.of.FTE.food.safety.employees)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 8(b). Q-Q Plot for number of FTE food safety employees") + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
  )

grid.arrange(histogram_proportion, qqplot_proportion, histogram_employees, qqplot_employees, ncol = 2, nrow = 2, top=textGrob("Distrbution of Proportion of Successful Responses and Number of FTE employees", gp=gpar(fontsize=12, fontface= 2)))

#Kolmogorov-Smirnov test for checking normality
ks.test(dataset$proportion.of.successful.responses, "pnorm", mean = mean(dataset$proportion.of.successful.responses), sd= sd(dataset$proportion.of.successful.responses))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  dataset$proportion.of.successful.responses
## D = 0.17467, p-value = 4.129e-08
## alternative hypothesis: two-sided
ks.test(dataset$Number.of.FTE.food.safety.employees, "pnorm", mean= mean(dataset$Number.of.FTE.food.safety.employees), sd = sd(dataset$Number.of.FTE.food.safety.employees))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  dataset$Number.of.FTE.food.safety.employees
## D = 0.15026, p-value = 4.113e-06
## alternative hypothesis: two-sided
#Since the p values < 0.05, we reject the null hypothesis and claim that the distribution these samples came from are not normal. 

#Running spearman correlation test
cor.test(dataset$proportion.of.successful.responses, dataset$Number.of.FTE.food.safety.employees, method= "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  dataset$proportion.of.successful.responses and dataset$Number.of.FTE.food.safety.employees
## S = 4490871, p-value = 0.0747
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1048238
#Visualizing scatterplot between both of these variables, with a 95% CI best fit line

ggplot(dataset, aes(x= Number.of.FTE.food.safety.employees, y=proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees", caption = "Figure 9. Scatter Plot between Proportion of Successful Responses and Number of FTE employees", color = "legend") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25))

#linear model
test.linear.model <- lm(proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees, data= dataset)
## Linear Regression Diagnostic Plots:

# Residuals vs Fitted Values Plot
residuals_test <- residuals(test.linear.model)
fitted_values_test <- fitted(test.linear.model)
data <- data.frame(residuals_test, fitted_values_test)
residuals.vs.fit.plot_test <- ggplot(data, aes(x = fitted_values_test, y = residuals_test)) + 
  geom_point(alpha = 0.5) + 
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") + 
  geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 10(a). Residual vs Fitted Values Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5))


#QQ plot for residuals
qq.plot.residuals_test <- ggplot(data, aes(sample=residuals_test)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 10(b). Q-Q Plot for errors") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  )

#Scale-Location Plot
sqrt_fitted_values_test <- sqrt(fitted_values_test)
Scale.location.plot_test <- ggplot(data, aes(x = sqrt_fitted_values_test, y = residuals_test)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 10(c). Scale Location Plot") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

#Cook's Distance Plot
cooks_distance_test <- cooks.distance(test.linear.model)
data.cooks_test <- data.frame(observation = 1:length(cooks_distance_test), cooks_distance_test)
Cooks.distance.plot_test <- ggplot(data.cooks_test, aes(x = observation, y = cooks_distance_test, color = cooks_distance_test)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 10(d). Cook's Distance Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )


#Residual vs Levarage Plot
leverage_test <- hatvalues(test.linear.model)
data.rl_test <- data.frame(residuals_test, leverage_test, observation = 1:length(leverage_test))
residual.vs.leverage.plot_test <- ggplot(data.rl_test, aes(x = leverage_test, y = residuals_test, color = residuals_test)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 10(e). Residual vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )



# Cook's Distance vs Levarage Plot
cooks_distance_test <- cooks.distance(test.linear.model)
data.cdl_test <- data.frame(cooks_distance_test, leverage_test, observation = 1:length(leverage_test))
Cooks.dist.vs.leverage.plot_test <- ggplot(data.cdl_test, aes(x = leverage_test, y = cooks_distance_test, color = cooks_distance_test)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 10(f). Cook's Distance vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

ggarrange(residuals.vs.fit.plot_test, qq.plot.residuals_test, Scale.location.plot_test, Cooks.distance.plot_test, residual.vs.leverage.plot_test, Cooks.dist.vs.leverage.plot_test, ncol = 2)
## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"
# Statistical tests to check if assumptions are not violated

#Shapiro-Wilk Normality Test
sresid <- studres(test.linear.model) 
shapiro.test(sresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.81682, p-value < 2.2e-16
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(test.linear.model)
## 
##  studentized Breusch-Pagan test
## 
## data:  test.linear.model
## BP = 0.0086749, df = 1, p-value = 0.9258
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(test.linear.model)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.03755358      2.072879   0.526
##  Alternative hypothesis: rho != 0
# Errors are heteroskedastic, and the distribution of the errors do not seem normal. Hence, linear regression results will not be reliable.

#In such cases, transforming the variables into normal distributions, and subsequnt outlier treatment solves the problem


# Creating a function that transforms the variable using the transformation technique that makes the variable the most normally distributed.
transformation <- function(x, name="", returnvector = TRUE) {
  
#if returnvector = FALSE, the transformed variable will be returned as a tibble having the column name specifying what transformation took place.
#if returnvector = TRUE, the transformed variable will be returned as a vector
#name is the prefix of the column name of the transformed variable that will be returned as a tibble when returnvector = FALSE
  
    if (is.numeric(x) == TRUE){
      if (skewness(x) >= 0) {
        # Create a list of potential transformations
        transformations <- list(log = log(x + 1), sqrt = sqrt(x), reciprocal = 1/(x + 1),
                               square = x^2, cube = x^3, original = x)
        
        # Check the normality of each transformation using the Shapiro-Wilk test
        p_values <- sapply(transformations, function(f) {
          shapiro.test(f)$p.value
        })
        # Identify the transformation with the highest p-value (indicating the most normal distribution)
        best_transformation <- names(which.max(p_values))
        # Return the transformed variable
        
        
        #creating a tibble of the transformed variable
        tibblevector <- tibble(transformations[[best_transformation]])
        
        names(tibblevector) <- c(paste(name,best_transformation))
        if (returnvector == FALSE){
          return (tibblevector)
        }
        else{
          return (pull(tibblevector, 1))
        }
      } else if (skewness(x) < 0) {
        
        #if skewness is less than 0, these transformations will work best when the variable is reflected. 
        
        #creating the reflected variable using the following formula.
        
        y <- max(x) + 1 - x
        # Create a list of potential transformations
        transformations <- list(reflected.log = log(y + 1), reflected.sqrt = sqrt(y), reflected.reciprocal = 1/(y + 1),
                               reflected.square = y^2, reflected.original = y, reflected.cube = y^3, original = x, sqrt = sqrt(x), reciprocal = 1/(x+1), log = log(x + 1))
        # Check the normality of each transformation using the Shapiro-Wilk test
        p_values <- sapply(transformations, function(f) {
          shapiro.test(f)$p.value
        })
        
        # Identify the transformation with the highest p-value (indicating the most normal distribution)
        best_transformation <- names(which.max(p_values))
        
        # Reflecting back the transformed variable to the original ordering
        tibblevector <- tibble(transformations[[best_transformation]])
        names(tibblevector) <- c(paste(name,best_transformation))
        if (returnvector == FALSE){
          return (tibblevector)
        }
        else{
          return (pull(tibblevector, 1))
        }
      }
    } else {
      # if the variable is not numeric, simply return the variable.
        
          return(x)
        }
        
    }
  
      
    
  


#Creatin another function which returns the list of column names with their type of transformation done
transformation.info <- function(data=NA){
      list_of_transformations <- list() 
      for (column in names(data)){
        column_vector <- pull(data, column)
        transformed <- transformation(column_vector, returnvector = FALSE)
        if (is.vector(transformed) == FALSE){
          list_of_transformations[[column]] <-  names(transformed)
        }
      }
      return (list_of_transformations)
}
#Applying the transformation function to both of these variables

transformed.proportion.of.successful.responses <- transformation(dataset$proportion.of.successful.responses, returnvector = FALSE)

names(transformed.proportion.of.successful.responses)
## [1] " reflected.log"
transformed.Number.of.FTE.employees <- transformation(dataset$Number.of.FTE.food.safety.employees, returnvector = FALSE)

names(transformed.Number.of.FTE.employees)
## [1] " log"
#Proportion.of.successful.responses got transformed using reflected log technique
#Number of FTE employees got transformed using log technique. 
#Note: Higher the Number of FTE employees, higher will be the log of it. 
#Note: Higher the proportion of successful responses, higher will be the reflected log of it. 
#Hence, if both variables are positively correlated, they are positively correlated in their actual form as well.
#Outlier Inspection

#Visualizing Box Plots for both proportion of successful response and Number of FTE employees

visualization.tibble <- tibble(transformed.proportion.of.successful.responses = pull(transformed.proportion.of.successful.responses, 1), transformed.Number.of.FTE.employees = pull(transformed.Number.of.FTE.employees,1))

boxplot1 <- ggplot(visualization.tibble, aes(y = transformed.Number.of.FTE.employees)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "No. of FTE employees", y = "", caption= "Figure 11(a). Box Plot for No of Employees")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))

boxplot2 <- ggplot(visualization.tibble, aes(y = transformed.proportion.of.successful.responses)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "Proportion of successful responses", y = "", caption = "Figure 11(b). Box Plot for proportion of successful responses")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))



grid.arrange(boxplot1, boxplot2, ncol = 2, top=textGrob("Box Plots", gp=gpar(fontsize=12, fontface= 2)))

#Outlier Treatment

#Creating a function to treat outliers in a numerical vector using IQR method
outlier.treatment <- function(x, how= "mean"){
  
  #if how = 'mean', the outliers will be replaced by the mean. 
  #However if how = 'maxmin', outliers will be replaced by the corresponding upper and lower IQR bounds
  
  if (is.numeric(x) == TRUE){
    
    # storing P75 Value
    q3 <- quantile(x, 0.75)
    
    #storing P25 value
    q1 <- quantile(x, 0.25)
    
    #Storing IQR Value
    iqr <- IQR(x)
    if (how == "mean"){
      
      #calculating mean after removing outliers
      .mean <- mean(x[x <= q3 + iqr & x >= q1 - iqr])
      
      #replacing upper outliers with the mean
      x[x > q3 + iqr] <- .mean
      
      #replacing lower outliers with the mean
      x[x < q1 - iqr] <- .mean
      
      #returning the treated vector
      return(x)
      
    } else if (how == "maxmin"){
      
      #replacing upper outliers with IQR upper bound
      x[x > q3 + iqr] <- q3 + iqr
      
      #replacing lower outliers with IQR lower bound
      x[x < q1 - iqr] <- q1 - iqr
      return(x)
    }
  }
  else{
    #if vector is not numerical, simply return the vector
    return(x)
  }
    
}

#Applyting the outlier.treatment function to over variables under study
visualization.tibble$transformed.Number.of.FTE.employees <- outlier.treatment(visualization.tibble$transformed.Number.of.FTE.employees)

visualization.tibble$transformed.proportion.of.successful.responses <- outlier.treatment(visualization.tibble$transformed.proportion.of.successful.responses)
#Transforming and subsequently treating outliers for all variables of the dataset and creating a new data set, for future analysis. 
transformed.dataset <- tibble(as.data.frame(lapply(dataset, transformation)))
transformed.dataset <- tibble(as.data.frame(lapply(transformed.dataset, outlier.treatment)))

# Returning what kind of transformation was done for each of the variables using UDF
list.of.transformation.info <- transformation.info(data= dataset)

#Storing it as a kable.
table1a <- kable(tibble(Variables = names(list.of.transformation.info), `Transformation applied`= t(as.data.frame(list.of.transformation.info))[,1]), caption = "Table 1. List of variables along with their transformations")

vectorb <- c("log" = " Values were replaced by their Natural Log", "sqrt" = "Values were replaced by their square root values" , "reciprocal" = "Values were replaced by their reciprocal values", "square" = "Values were replaced by their squared values", "cube" = "Values were replaced by their cube values", "original" = "All Values were kept original", "reflected.log" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then natural log was applied to the resultant values", "reflected.sqrt" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then square root was applied to the resultant values", "reflected.reciprocal" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then reciprocal of all of the resultant values was taken", "reflected.square" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were squared", "reflected.cube" = "Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were cubed")

table1b <- kable(tibble(`Transformation applied` = names(vectorb), `Meaning of the transformation` = vectorb), caption = "Table 1(b). Explaination of all transformations")
#Visualizing the scatter plot between these transformed variables

transformed.scatter.plot.between.p.and.n <- ggplot(transformed.dataset, aes(x= Number.of.FTE.food.safety.employees , y= proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees after transformation", caption = "Figure 12. Scatter Plot between Proportion of Successful Responses and Number of FTE employees after transformation") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25))

transformed.scatter.plot.between.p.and.n

qqplot_transformed_proportion <- ggplot(transformed.dataset, aes(sample=proportion.of.successful.responses)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 13(a). Q-Q Plot for proportion of successful reponses \n(after transformation)") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  )



  
histogram_transformed_proportion <- ggplot(transformed.dataset, aes(proportion.of.successful.responses))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'proportion of successful responses',y='Density', title = 'Distribution', caption = "Figure 13(b). Histogtam for proportion of successful responses \n(after transformation)") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=3.75, y=0.6, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$proportion.of.successful.responses), 2)), size = 3)


histogram_transformed_employees <- ggplot(transformed.dataset, aes(Number.of.FTE.food.safety.employees))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees',y='Density', title = 'Distribution', caption = "Figure 14(a). Histogtam for number of FTE food safety employees \n(after transformation") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=1.2, y=3, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$Number.of.FTE.food.safety.employees), 2)), size= 3)

qqplot_transformed_employees<- ggplot(transformed.dataset, aes(sample= Number.of.FTE.food.safety.employees)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 14(b). Q-Q Plot for number of FTE food safety employees \n(after transformation)") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  )

grid.arrange(histogram_transformed_proportion, qqplot_transformed_proportion, histogram_transformed_employees, qqplot_transformed_employees, ncol = 2, nrow = 2, top=textGrob("Distrbution of Proportion of Successful Responses and Number of FTE employees \nafter transformation", gp=gpar(fontsize=12, fontface= 2)))

#testing for normality using Kolmogorov-Smirnov test
ks.test(transformed.dataset$proportion.of.successful.responses, "pnorm", mean=mean(transformed.dataset$proportion.of.successful.responses), sd= sd(transformed.dataset$proportion.of.successful.responses))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  transformed.dataset$proportion.of.successful.responses
## D = 0.075553, p-value = 0.07297
## alternative hypothesis: two-sided
ks.test(transformed.dataset$Number.of.FTE.food.safety.employees, "pnorm", mean=mean(transformed.dataset$Number.of.FTE.food.safety.employees), sd=sd(transformed.dataset$Number.of.FTE.food.safety.employees))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  transformed.dataset$Number.of.FTE.food.safety.employees
## D = 0.077851, p-value = 0.05948
## alternative hypothesis: two-sided
#Since these are continuous variables and roughly normally distributed, we run pearson correlation
cor.test(y = transformed.dataset$proportion.of.successful.responses, x = transformed.dataset$Number.of.FTE.food.safety.employees, method= "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  transformed.dataset$Number.of.FTE.food.safety.employees and transformed.dataset$proportion.of.successful.responses
## t = 1.0152, df = 288, p-value = 0.3108
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0558476  0.1737010
## sample estimates:
##        cor 
## 0.05971611
#Creating simple linear regression model between both of the variables
simple.linear.model <- lm(proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees, data= transformed.dataset)
summary(simple.linear.model)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees, 
##     data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61006 -0.72858 -0.01466  0.68214  1.92440 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           1.9952     0.2471   8.075 1.85e-14 ***
## Number.of.FTE.food.safety.employees   0.1566     0.1542   1.015    0.311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8472 on 288 degrees of freedom
## Multiple R-squared:  0.003566,   Adjusted R-squared:  0.0001062 
## F-statistic: 1.031 on 1 and 288 DF,  p-value: 0.3108
#estimation approach
cbind(coefficient = coef(simple.linear.model), confint(simple.linear.model))
##                                     coefficient      2.5 %    97.5 %
## (Intercept)                           1.9952235  1.5089203 2.4815267
## Number.of.FTE.food.safety.employees   0.1565916 -0.1469945 0.4601778
#Creating a baseline model
baseline.model <- lm(proportion.of.successful.responses ~ 1, data = transformed.dataset)
summary(baseline.model)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ 1, data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54778 -0.70677 -0.01729  0.69574  1.91070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.24092    0.04975   45.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8473 on 289 degrees of freedom
#Comparing model with baseline model
anova(baseline.model, simple.linear.model)
## Analysis of Variance Table
## 
## Model 1: proportion.of.successful.responses ~ 1
## Model 2: proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    289 207.47                           
## 2    288 206.73  1   0.73984 1.0307 0.3108
## Linear Regression Diagnostic Plots:

# Residuals vs Fitted Values Plot
residuals <- residuals(simple.linear.model)
fitted_values <- fitted(simple.linear.model)
data <- data.frame(residuals, fitted_values)
residuals.vs.fit.plot <- ggplot(data, aes(x = fitted_values, y = residuals)) + 
  geom_point(alpha = 0.5) + 
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") + 
  geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 15(a). Residual vs Fitted Values Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5))


#QQ plot for residuals
qq.plot.residuals <- ggplot(data, aes(sample=residuals)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 15(b). Q-Q Plot for errors") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  )

#Scale-Location Plot
sqrt_fitted_values <- sqrt(fitted_values)
Scale.location.plot <- ggplot(data, aes(x = sqrt_fitted_values, y = residuals)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 15(c). Scale Location Plot") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

#Cook's Distance Plot
cooks_distance <- cooks.distance(simple.linear.model)
data.cooks <- data.frame(observation = 1:length(cooks_distance), cooks_distance)
Cooks.distance.plot <- ggplot(data.cooks, aes(x = observation, y = cooks_distance, color = cooks_distance)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 15(d). Cook's Distance Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )


#Residual vs Levarage Plot
leverage <- hatvalues(simple.linear.model)
data.rl <- data.frame(residuals, leverage, observation = 1:length(leverage))
residual.vs.leverage.plot <- ggplot(data.rl, aes(x = leverage, y = residuals, color = residuals)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 15(e). Residual vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )



# Cook's Distance vs Levarage Plot
cooks_distance <- cooks.distance(simple.linear.model)
data.cdl <- data.frame(cooks_distance, leverage, observation = 1:length(leverage))
Cooks.dist.vs.leverage.plot <- ggplot(data.cdl, aes(x = leverage, y = cooks_distance, color = cooks_distance)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 15(f). Cook's Distance vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

ggarrange(residuals.vs.fit.plot, qq.plot.residuals, Scale.location.plot, Cooks.distance.plot, residual.vs.leverage.plot, Cooks.dist.vs.leverage.plot, ncol = 2)
## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"
# Statistical tests to check if assumptions are not violated

#Shapiro-Wilk Normality Test
sresid <- studres(simple.linear.model) 
shapiro.test(sresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.97443, p-value = 4.757e-05
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(simple.linear.model)
## 
##  studentized Breusch-Pagan test
## 
## data:  simple.linear.model
## BP = 3.8864, df = 1, p-value = 0.04868
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(simple.linear.model)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01787908      2.032779   0.806
##  Alternative hypothesis: rho != 0
#Below is a User-Defined Function to select the variables out of all possible variables of the dataset that the simpler model was derived from that best enhances the model's fit.

forward.selection <- function(model1, variableunderstudy, vector.of.variables.to.try, interaction = FALSE, vector.of.variables.for.interaction = NA){
  
  # Define the variables to include in the model
  
  best.model <- model1
  linear.model.1.fs <- model1
  
  # Perform forward selection
  while (TRUE){
    vector.of.terms <- c()
    vector.of.F.values <- c()
    for (var in vector.of.variables.to.try) {
      if (!var%in%names(linear.model.1.fs$model)){
        temp_model <- update(linear.model.1.fs, paste0("~. + ",var))
        if (any(is.na(temp_model$coefficients)) == FALSE){
          vif_value <- vif(temp_model)[variableunderstudy]
          if (is.na(vif_value) == TRUE){
            vif_value <- vif(temp_model)[variableunderstudy,]["GVIF"]
          }
          f_value <- anova(linear.model.1.fs, temp_model)[5][2,]
          p_value <- anova(linear.model.1.fs, temp_model)[6][2,]
          best_model_fvalue <- anova(linear.model.1.fs, best.model)[5][2,]
          if (is.na(best_model_fvalue)){
            best_model_fvalue <- 0
          }
          if (vif_value < 5 && p_value < 0.05 && f_value >= 0 && f_value >= best_model_fvalue) {
            vector.of.terms <-append(vector.of.terms, var)
            vector.of.F.values <- append(vector.of.F.values, f_value)
          }
        }
      }
    }
    if (length(vector.of.terms) > 0){
      best.model <- update(linear.model.1.fs, paste0("~. + ",vector.of.terms[match(max(vector.of.F.values), vector.of.F.values)]))
    
      linear.model.1.fs <- best.model
                                                     
    }
    
    else{
      break
    }
  }
  if (interaction == FALSE){
    return (linear.model.1.fs)
  }
  else{
    
  
  
  
  
  
  
  linear.model.1.fs.withinteraction <- linear.model.1.fs
  vector.of.interaction.terms <- c()
  vector.of.F.values.interaction <- c()
  for (var in vector.of.variables.for.interaction) {
    temp_model <- update(linear.model.1.fs, paste0("~. + ",paste0(paste0(variableunderstudy, "*"),var)))
    if (any(is.na(temp_model$coefficients)) == FALSE){
  
      f_value <- anova(linear.model.1.fs, temp_model)[5][2,]
      p_value <- anova(linear.model.1.fs, temp_model)[6][2,]
      best_model_fvalue <- anova(linear.model.1.fs, best.model)[5][2,]
      if (is.na(best_model_fvalue)){
        best_model_fvalue <- 0
      }
      if (p_value < 0.05 && f_value > 0 && f_value > best_model_fvalue) {
        vector.of.interaction.terms <-append(vector.of.interaction.terms, var)
        vector.of.F.values.interaction <- append(vector.of.F.values.interaction, f_value)
      }
    }
  }
  if (length(vector.of.interaction.terms) > 0){
    
    best.model <- update(linear.model.1.fs, paste0("~. + ",vector.of.interaction.terms[match(max(vector.of.F.values.interaction), vector.of.F.values.interaction)]))
    
    linear.model.1.fs.withinteraction <- best.model
  }
  return (linear.model.1.fs.withinteraction)
  }
}
#Applying the UDF on the model to make the best multiple linear regression model!

mainvariable <- "Number.of.FTE.food.safety.employees"

vars <- colnames(subset(transformed.dataset, select = c(Country, LAType, Total.establishments, Establishments.not.yet.rated, Establishments.outside.the.programme, perc.of.broadly.comp.rated.est, perc.of.broadly.comp.rated.and.unrated.est, A.rated.establishments, perc.of.broadly.comp.A.rated.est, B.rated.establishments, perc.of.broadly.comp.B.rated.est, C.rated.establishments, perc.of.broadly.comp.C.rated.est, D.rated.establishments, perc.of.broadly.comp.D.rated.est, E.rated.establishments, perc.of.broadly.comp.E.rated.est)))




interaction.terms <- c("total.no.of.establishments.given.interventions", "perc.of.broadly.comp.rated.and.unrated.est", "no.of.enforcement.actions.per.type", "Country", "LAType")

#Applying the UDF and creating the model, storing it in multiple.linear.model variable
multiple.linear.model <- forward.selection(model1 = simple.linear.model, variableunderstudy = mainvariable, vector.of.variables.to.try = vars, interaction = TRUE, vector.of.variables.for.interaction = interaction.terms)

#Sumamry output of the best multiple linear regression model
summary(multiple.linear.model)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.FTE.food.safety.employees + 
##     Establishments.not.yet.rated + E.rated.establishments + perc.of.broadly.comp.rated.and.unrated.est, 
##     data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56580 -0.60396 -0.03199  0.58447  2.01252 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                                -1.88846    0.90025  -2.098  0.03681 * 
## Number.of.FTE.food.safety.employees        -0.33870    0.17303  -1.957  0.05127 . 
## Establishments.not.yet.rated                0.15697    0.08013   1.959  0.05108 . 
## E.rated.establishments                      0.48991    0.16171   3.030  0.00267 **
## perc.of.broadly.comp.rated.and.unrated.est  0.44237    0.14678   3.014  0.00281 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7817 on 285 degrees of freedom
## Multiple R-squared:  0.1606, Adjusted R-squared:  0.1489 
## F-statistic: 13.64 on 4 and 285 DF,  p-value: 3.479e-10
# p value is not less than 0.05, indicating that relationhip between Number of FTE employees and proportion of successful responses is not significant.

Correlation and Regression Analysis: Does increase in number of employees per establishment increase the likelihood of establishments successfully responding to enforcement actions?

# Visualizing the distribution of variables

histogram_employees_per_establishment <- ggplot(dataset, aes(Number.of.employees.per.establishment))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees per establishment',y='Density', title = 'Distribution', caption = "Figure 16(a). Histogtam for number of FTE food safety employees") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=0.0075, y=200, label= sprintf("Skewness: %s", round(skewness(dataset$Number.of.employees.per.establishment), 2)))


qqplot_employees_per_establishment<- ggplot(dataset, aes(sample=Number.of.employees.per.establishment)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 16(b). Q-Q Plot for number of FTE food safety employees") + theme(
    plot.title = element_text(color = "black", size = 10, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 7)
  )

grid.arrange(histogram_employees_per_establishment, qqplot_employees_per_establishment, ncol = 2, top=textGrob("Distrbution of Number of FTE employees \nper establishment", gp=gpar(fontsize=12, fontface= 2)))

#Kolmogorov-Smirnov test for checking normality

ks.test(dataset$Number.of.employees.per.establishment, "pnorm", mean= mean(dataset$Number.of.employees.per.establishment), sd = sd(dataset$Number.of.employees.per.establishment))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  dataset$Number.of.employees.per.establishment
## D = 0.10284, p-value = 0.004336
## alternative hypothesis: two-sided
#Since the p value < 0.05, we reject the null hypothesis and claim that the distribution this sample came from is not normal. 

#Running spearman correlation test
cor.test(dataset$proportion.of.successful.responses, dataset$Number.of.employees.per.establishment, method= "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  dataset$proportion.of.successful.responses and dataset$Number.of.employees.per.establishment
## S = 3131228, p-value = 7.909e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2296695
#Checking the type of best transformation for this variable
list.of.transformation.info$Number.of.employees.per.establishment
## [1] " sqrt"
#Since transformation of this variable is square root and the transformation of proportion of successful responses is reflected log, positive correlation between thee variables will indicate negative correlation in their original form, and vice versa.
#Visualizing scatterplot between both of these variables, with a 95% CI best fit line

ggplot(dataset, aes(x= Number.of.employees.per.establishment, y=proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees per establishment", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees per est.", caption = "Figure 17. Scatter Plot between Proportion of Successful Responses and Number of FTE employees per est.", color = "legend") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25))

# Errors are heteroskedastic, and the distribution of the errors do not seem normal. Hence, linear regression results will not be reliable.

#In such cases, transforming the variables into normal distributions, and subsequnt outlier treatment solves the problem. We have already transformed and removed outliers before.


#Visualizing the scatter plot between these transformed variables

ggplot(transformed.dataset, aes(x= Number.of.employees.per.establishment , y= proportion.of.successful.responses)) + geom_point() + geom_smooth(method = lm, fill= "grey") + labs(y = "Proportion of Successful Responses", x = "Number of FTE employees per establishment", title = "Scatter Plot between Proportion of Successful Responses \nand Number of FTE employees per establishment after transformation", caption = "Figure 18. Scatter Plot between Proportion of Successful Responses and Number of FTE employees after transformation") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25))

#Visualizing Numebr of employees after transformation
histogram_transformed_employees_per_est <- ggplot(transformed.dataset, aes(Number.of.employees.per.establishment))+ geom_histogram(aes(y = ..density..), bins = 50, colour = "Black", fill = "blue") + labs(x= 'number of FTE food safety employees per establishment',y='Density', title = 'Distribution', caption = "Figure 19(a). Histogtam for number of FTE food safety employees \n per establishment(after transformation)") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  ) + geom_density(color = "#000000", fill = "grey", alpha = 0.6) + annotate("text", x=0.06, y=60, label= sprintf("Skewness: %s", round(skewness(transformed.dataset$Number.of.employees.per.establishment), 2)), size= 3)

qqplot_transformed_employees_per_est<- ggplot(transformed.dataset, aes(sample= Number.of.employees.per.establishment)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 19(b). Q-Q Plot for number of FTE food safety employees \n per establishment(after transformation)") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25, size = 6)
  )

grid.arrange(histogram_transformed_employees_per_est, qqplot_transformed_employees_per_est, ncol = 2, top=textGrob("Distrbution of Number of FTE employees per estblishment \nafter transformation", gp=gpar(fontsize=12, fontface= 2)))

#testing normality
ks.test(transformed.dataset$Number.of.employees.per.establishment, "pnorm", mean= mean(transformed.dataset$Number.of.employees.per.establishment), sd = sd(transformed.dataset$Number.of.employees.per.establishment))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  transformed.dataset$Number.of.employees.per.establishment
## D = 0.057347, p-value = 0.296
## alternative hypothesis: two-sided
#Since these are continuous variables and roughly normally distributed, we run pearson correlation
cor.test(y = transformed.dataset$proportion.of.successful.responses, x = transformed.dataset$Number.of.employees.per.establishment, method= "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  transformed.dataset$Number.of.employees.per.establishment and transformed.dataset$proportion.of.successful.responses
## t = -3.5192, df = 288, p-value = 0.0005029
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.31095707 -0.08997457
## sample estimates:
##        cor 
## -0.2030499
#Creating simple linear regression model between both of the variables
simple.linear.model.1 <- lm(proportion.of.successful.responses ~ Number.of.employees.per.establishment, data= transformed.dataset)
summary(simple.linear.model.1)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.employees.per.establishment, 
##     data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75608 -0.59873 -0.08871  0.61966  2.22016 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             3.4809     0.3557   9.786  < 2e-16 ***
## Number.of.employees.per.establishment -24.1393     6.8594  -3.519 0.000503 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8311 on 288 degrees of freedom
## Multiple R-squared:  0.04123,    Adjusted R-squared:  0.0379 
## F-statistic: 12.38 on 1 and 288 DF,  p-value: 0.0005029
#estimation approach
cbind(coefficient = coef(simple.linear.model.1), confint(simple.linear.model.1))
##                                       coefficient      2.5 %     97.5 %
## (Intercept)                              3.480877   2.780766   4.180988
## Number.of.employees.per.establishment  -24.139299 -37.640114 -10.638483
#Creating a baseline model
baseline.model <- lm(proportion.of.successful.responses ~ 1, data = transformed.dataset)
summary(baseline.model)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ 1, data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54778 -0.70677 -0.01729  0.69574  1.91070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.24092    0.04975   45.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8473 on 289 degrees of freedom
#Comparing the model with the baseline model
anova(baseline.model, simple.linear.model.1)
## Analysis of Variance Table
## 
## Model 1: proportion.of.successful.responses ~ 1
## Model 2: proportion.of.successful.responses ~ Number.of.employees.per.establishment
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    289 207.47                                  
## 2    288 198.91  1    8.5538 12.385 0.0005029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Linear Regression Diagnostic Plots:

# Residuals vs Fitted Values Plot
residuals <- residuals(simple.linear.model.1)
fitted_values <- fitted(simple.linear.model.1)
data <- data.frame(residuals, fitted_values)
residuals.vs.fit.plot <- ggplot(data, aes(x = fitted_values, y = residuals)) + 
  geom_point(alpha = 0.5) + 
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") + 
  geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(y = "Residuals", x = "Fitted Values", title = "Residuals vs Fitted Values Plot", caption = "Figure 20(a). Residual vs Fitted Values Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5))


#QQ plot for residuals
qq.plot.residuals <- ggplot(data, aes(sample=residuals)) +
  stat_qq() + 
  stat_qq_line() + theme_classic() + labs(x= 'Theoretical Quantities',y='Sample Quantities', title = 'Q-Q Plot', caption = "Figure 20(b). Q-Q Plot for errors") + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  )

#Scale-Location Plot
sqrt_fitted_values <- sqrt(fitted_values)
Scale.location.plot <- ggplot(data, aes(x = sqrt_fitted_values, y = residuals)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", color = "blue", se = FALSE) + labs(x= "Square Root of Fitted Values",y="Residuals", title = "Scale-Location Plot", caption = "Figure 20(c). Scale Location Plot") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

#Cook's Distance Plot
cooks_distance <- cooks.distance(simple.linear.model.1)
data.cooks <- data.frame(observation = 1:length(cooks_distance), cooks_distance)
Cooks.distance.plot <- ggplot(data.cooks, aes(x = observation, y = cooks_distance, color = cooks_distance)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Observation",y="Cook's Distance", title = "Cook's Distance Plot", caption = "Figure 20(d). Cook's Distance Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )


#Residual vs Levarage Plot
leverage <- hatvalues(simple.linear.model.1)
data.rl <- data.frame(residuals, leverage, observation = 1:length(leverage))
residual.vs.leverage.plot <- ggplot(data.rl, aes(x = leverage, y = residuals, color = residuals)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") + labs(x= "Leverage",y="Residuals", title = "Residual vs Leverage Plot", caption = "Figure 20(e). Residual vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )



# Cook's Distance vs Levarage Plot
cooks_distance <- cooks.distance(simple.linear.model.1)
data.cdl <- data.frame(cooks_distance, leverage, observation = 1:length(leverage))
Cooks.dist.vs.leverage.plot <- ggplot(data.cdl, aes(x = leverage, y = cooks_distance, color = cooks_distance)) +
  geom_point() + 
  scale_color_gradient(low = "blue", high = "red") + labs(x= "Leverage",y="Cook's Distance", title = "Cook's Distance vs Leverage Plot", caption = "Figure 20(f). Cook's Distance vs Leverage Plot") +
  theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.5)
  )

ggarrange(residuals.vs.fit.plot, qq.plot.residuals, Scale.location.plot, Cooks.distance.plot, residual.vs.leverage.plot, Cooks.dist.vs.leverage.plot, ncol = 2)
## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"
# Statistical tests to check if assumptions are not violated

#Shapiro-Wilk Normality Test
sresid <- studres(simple.linear.model.1) 
shapiro.test(sresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.98466, p-value = 0.003423
#studentized Breusch-Pagan test to check homoskedasticity of residuals
lmtest::bptest(simple.linear.model.1)
## 
##  studentized Breusch-Pagan test
## 
## data:  simple.linear.model.1
## BP = 1.3144, df = 1, p-value = 0.2516
#Durbin-Watson test to check if residuals are autocorrelated
durbinWatsonTest(simple.linear.model.1)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.00650269      1.984469    0.92
##  Alternative hypothesis: rho != 0
#Applying the UDF on the model to make the best multiple linear regression model!

mainvariable <- "Number.of.employees.per.establishment"

vars <- colnames(subset(transformed.dataset, select = c(Country, LAType, Total.establishments, Establishments.not.yet.rated, Establishments.outside.the.programme, perc.of.broadly.comp.rated.est, perc.of.broadly.comp.rated.and.unrated.est, A.rated.establishments, perc.of.broadly.comp.A.rated.est, B.rated.establishments, perc.of.broadly.comp.B.rated.est, C.rated.establishments, perc.of.broadly.comp.C.rated.est, D.rated.establishments, perc.of.broadly.comp.D.rated.est, E.rated.establishments, perc.of.broadly.comp.E.rated.est)))

interaction.terms <- c("total.no.of.establishments.given.interventions", "perc.of.broadly.comp.rated.and.unrated.est", "Country", "LAType")

multiple.linear.model.1 <- forward.selection(model1 = simple.linear.model.1, variableunderstudy = mainvariable, vector.of.variables.to.try = vars, interaction = TRUE, vector.of.variables.for.interaction = interaction.terms)


summary(multiple.linear.model.1)
## 
## Call:
## lm(formula = proportion.of.successful.responses ~ Number.of.employees.per.establishment + 
##     Establishments.not.yet.rated + perc.of.broadly.comp.rated.and.unrated.est + 
##     E.rated.establishments, data = transformed.dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54778 -0.61265 -0.05003  0.58788  2.03381 
## 
## Coefficients:
##                                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                                 -0.41779    1.04895  -0.398   0.6907   
## Number.of.employees.per.establishment      -11.15271    6.91707  -1.612   0.1080   
## Establishments.not.yet.rated                 0.12848    0.08084   1.589   0.1131   
## perc.of.broadly.comp.rated.and.unrated.est   0.42800    0.14709   2.910   0.0039 **
## E.rated.establishments                       0.28490    0.14392   1.980   0.0487 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7834 on 285 degrees of freedom
## Multiple R-squared:  0.157,  Adjusted R-squared:  0.1452 
## F-statistic: 13.27 on 4 and 285 DF,  p-value: 6.259e-10

Section 2

This report presents the results of the following analyses requested by the board:

  1. Visualize the distributions across local authorities of the % of enforcement actions successfully achieved.

  2. Examine whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions by:

  1. Determining whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority.

  2. Determining whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority

Distribution across Local Authorities of the % of enforcement actions successfully achieved.

In Figure 1, We can see that for most local authorities, success rate is between 80-100%, with an average of 86.62%.

To investigate further, Figures 2-6 below show success rate by rating, with A rated establishments having the highest average success rate of 98%. The visualizations illustrate that more harmful impact of establishments on public health, more was the average success rate of achieving enforcement actions for those establishments. However, further analysis is needed to prove that the rating of the establishments is causing the shift in their average success rate of achieving enforcement actions, which is out of the scope of this report.

However, it is important to consider how the proportion of successful responses(or success rate) in each local authority can be improved, if it can be improved by implementing necessary actions.

Proportion of successful responses including all establishments that are part of the programme was calculated using the following formula:

\[{\text{proportion of successful responses for all establishments}} = \frac{\text{(Number of interventions achieved by rated establishments} + \text{Number of interventions achieved by establishments not yet rated)}}{{(\text{Total number of interventions given to rated establishments} + \text{Total number of interventions given to establishments not yet rated)}}}\]

Where, \[{\text{Number of interventions achieved by rated establishments}} = {\text{Number of rated establishments that were given an intervention x }\frac{\text{ Percentage of interventions achieved by rated establishements}}{100}}\text{,}\]

\[{\text{Number of interventions achieved by unrated establishments}} = {\text{Number of unrated establishments that were given an intervention x }\frac{\text{ Percentage of interventions achieved by unrated establishements}}{100}}\text{,}\]

\[{\text{Number of rated establishments that were given an intervention}} = {\text{Total Number of rated etablishments in the LA x }\frac{\text{100 - % of rated establishments which are Broadly compliant}}{100}}\text{,}\]

\[{\text{Number of unrated establishments that were given an intervention}} = {\text{(Total Number of establishments that are part of the programme x } \frac{\text{100 - % of establishments (incl. rated and unrated) which are Broadly compliant}}{100}}\text{) - Number of rated establishments that were given an intervention,}\]

2) Does employing more professional enforcement officers increases the likelihood of establishments successfully responding to enforcement actions?

a) Examining the relationship between proportion of successful response and Number of employees

Before conducting further analysis, variables were appropriately transformed. Table 1 illustrates the transformation that was applied to each variable. Note that Proportion of Successful responses were transformed using “reflected.log” technique (refer Table 1). This means that an increase in the transformed value of proportion of successful response indicates a decrease in its original value. Moreover, Number of employees were transformed using “log”, i.e., the original values were replaced by their natural log values. This means that an increase in the transformed value will indicate an increase in its original value. Hence, a positive relationship between these transformed variables will indicate that there is a negative relationship in their original values, and vice-versa.

Table 1. List of variables along with their transformations
Variables Transformation applied
Total.establishments log
Establishments.not.yet.rated log
Establishments.outside.the.programme log
perc.of.broadly.comp.rated.est reflected.log
perc.of.broadly.comp.rated.and.unrated.est reflected.log
A.rated.establishments reciprocal
perc.of.broadly.comp.A.rated.est sqrt
B.rated.establishments log
perc.of.broadly.comp.B.rated.est reflected.sqrt
C.rated.establishments log
perc.of.broadly.comp.C.rated.est reflected.sqrt
D.rated.establishments log
perc.of.broadly.comp.D.rated.est reflected.reciprocal
E.rated.establishments log
perc.of.broadly.comp.E.rated.est reflected.reciprocal
perc.interventions.achieved.by.rated.est reflected.log
perc.interventions.achieved.by.A.rated.est reflected.reciprocal
perc.interventions.achieved.by.B.rated.est reflected.log
perc.interventions.achieved.by.C.rated.est reflected.log
perc.interventions.achieved.by.D.rated.est reflected.log
perc.interventions.achieved.by.E.rated.est reflected.log
perc.interventions.achieved.by.unrated.est reflected.log
est.subject.to.enforcement.Voluntary.closure log
est.subject.to.enforcement.Seizure.detention.and.surrender.of.food reciprocal
est.subject.to.enforcement.Suspension.or.revocation.of.approval.or.licence reciprocal
est.subject.to.enforcement.Hygiene.prohibition.notice reciprocal
est.subject.to.enforcement.Prohibition.order reciprocal
est.subject.to.enforcement.Simple.caution reciprocal
est.subject.to.enforcement.Hygiene.improvement.notices log
est.subject.to.enforcement.Remedial.action.and.detention.notices reciprocal
est.subject.to.written.warnings log
est.subject.to.enforcement.Prosecutions.concluded reciprocal
Number.of.FTE.food.safety.employees log
no.of.rated.establishments.given.interventions log
no.of.unrated.establishments.given.interventions log
proportion.of.successful.responses reflected.log
total.no.of.establishments.given.interventions log
no.of.enforcement.actions.per.type log
Number.of.employees.per.establishment sqrt
Table 1(b). Explaination of all transformations
Transformation applied Meaning of the transformation
log Values were replaced by their Natural Log
sqrt Values were replaced by their square root values
reciprocal Values were replaced by their reciprocal values
square Values were replaced by their squared values
cube Values were replaced by their cube values
original All Values were kept original
reflected.log Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then natural log was applied to the resultant values
reflected.sqrt Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then square root was applied to the resultant values
reflected.reciprocal Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and then reciprocal of all of the resultant values was taken
reflected.square Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were squared
reflected.cube Each Value was first subtracted by the maximum value of the variable to get the mirror image of the distribution, and the resultant values were cubed

Pearson correlation test was conducted which showed a low correlation coefficient value of 0.05, indicating a weak positive relationship between these variables. Moreover, this relationship is not statistically significant (p < 0.05). Hence, we fail to reject the hypothesis that true correlation of both of the transformed variables is 0.

For further investigation, the simple linear regression model between both of these transformed variables was constructed. According to the results, it was noted that there is an increase of 0.51 units in the transformed value of proportion of successful responses per unit increase in the transformed value of number of employees. However, this increase is not significantly different from 0, t(288) = 1.015, p > 0.05. Hence we fail to reject the null hypothesis that without taking control variables into consideration, there is no relationship between both the variables.

The scatter plot (Figure 11) explains the regression result using Estimation approach. The best fit line indicates a positive relationship between proportion of successful responses and the number of FTE employees after being transformed and treated for outliers. However, the 95% CI area for the best fit line includes the line with slope=0, indicating that it is likely that there is no true linear relationship between the two variables when using the population data. Therefore, the null hypothesis that there is no relationship between the two variables cannot be rejected.

To further evaluate the relationship between Number of FTE employees and proportion of successful responses, a forward feature-selection algorithm was created to build the best multiple linear regression model for answering the research question. The model will best help in understanding the relationship between Number of FTE employees and proportion of successful responses when controlling for other significant variables and also understand any interactive effects between Number of FTE employees and any of the candidates selected to make an interaction term with.

Variables such as Total Number of establishments that were given interventions, Percentage of broadly compliant establishments, Number of Enforcement Actions, Country and LAType were selected as potential candidates with which Number of FTE employees may showcase interaction effects with. These specific variables were selected because they may have an impact on the relationship due to factors such as resources and capacity to implement interventions, level of compliance and non-compliance, and resources available to the local authority.

Moreover, the possible candidates for control variables were only limited to those variables which can be considered as leading indicators for the value of proportion of successful responses.

The best multiple linear regression model revealed no interactive term and a decrease in the transformed values of proportion of successful responses by 0.33 units with a unit increase in the transformed value of number of employees, when controlling for other variables. This decrease is not significantly different from 0, t(285) = -1.957, p > 0.05, leading to the conclusion that there is no relationship between the two variables.

However, there might exist a relationship between Number of employees allocated per establishment and proportion of successful responses.Number of employees allocated per establishment was calculated using the following formula:

\[{\text{Number of Employees per establishment}} = \frac{\text{Number of employees}}{\text{Total Number of Establishments}}\]

b) Examining the relationship between proportion of successful response and Number of employees per establishment

As per Table 1, Number of employees per establishment was transformed using “sqrt” method. Since the increase in its transformed value will indicate an increase in its original value, it is important to note for further analysis that a positive relationship between the transformed values of Number of employees per establishment and proportion of successful responses will indicate a negative relationship in their original values.

Pearson correlation test between both of these transformed variables indicated negative correlation coefficient value of -0.2, but was significantly different from 0 (p < 0.05), thus rejecting the null hypothesis that the true correlation is 0.

Simple linear regression model between both of these variables found a significant decrease of 24.13 in the transformed value of proportion of successful responses per unit increase in the transformed value of number of employees per establishment. This decrease is significantly different from 0, t(288)=9.786, p<0.001.

Additionally, the scatter plot (Figure 18) explains the regression result using Estimation approach. The best fit line indicates a negative relationship between proportion of successful responses and the number of FTE employees after being transformed and treated for outliers. Moreover, the 95% CI area for the best fit line does not include the line with slope=0, indicating that there will be true linear relationship between the two variables when using the population data.

To further evaluate the relationship, the same forward-feature-selection algorithm was used with the same candidates for control variables and interaction terms.

In this case, the best multiple regression model revealed that there was no significant relationship between the transformed values of number of employees per establishment and proportion of successful responses after controlling for other variables, \(t(285) = -1.612, p > 0.05\). The control variables captured the effect that the number of employees has on the proportion of successful responses, making the effect of the number of employees redundant. Hence, increasing the number of employees will not necessarily increase the likelihood of successful responses to enforcement actions. The relationship between the two variables was statistically insignificant after including control variables in the linear regression model.

Hence, we do not have sufficient evidence to suggest that increasing number of employees or increasing number of employees per establishment will increase the likelihood of establishments successfully responding to enforcement actions.

Question 2

This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made

Section 1

This report is designed to meet the requests of managers of a publishing company, undertaking the specific analyses requested. This reports aims to:

  1. Examine whether books from different genres have different daily sales.

  2. Examine whether the sales of the books depend on their average review scores and total number of reviews.

  3. Examine the effect of sales price of the book on the number of sales for the book and if it differs for books belonging to different genres.

Data Dictionary

This dataset is provided by a Publishing Company. The variables in the dataset as described below.

Variable Description
sold by Name of the publisher selling the books
publisher.type Type of publisher selling the book
genre Genre or Category of the book
avg.review Average review of the book by the readers (out of 5)
daily.sales average number of sales (minus refunds) across all days in the period
total.reviews Total number of reviews given for the book by the readers
sale.price Average price for which the book is sold across all sales in the period

Loading and Reading the provided Dataset

#Loading the required original dataset for the project and creating a dataframe.
publishing.data <-  read_csv("publisher_sales.csv")

#Rename the columns in the dataset that have special characters or other insignificant characters to improve the readibility of the dataset.
names(publishing.data)[names(publishing.data) == 'sold by'] <- 'sold.by'

Data Quality Checks, Visualizations and Cleaning the Data

#Checking the summary of the data
summary(publishing.data)
##    sold.by          publisher.type        genre             avg.review     daily.sales    
##  Length:6000        Length:6000        Length:6000        Min.   :0.000   Min.   : -0.53  
##  Class :character   Class :character   Class :character   1st Qu.:4.100   1st Qu.: 56.77  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.400   Median : 74.29  
##                                                           Mean   :4.267   Mean   : 79.11  
##                                                           3rd Qu.:4.620   3rd Qu.: 98.02  
##                                                           Max.   :4.980   Max.   :207.98  
##  total.reviews     sale.price    
##  Min.   :  0.0   Min.   : 0.740  
##  1st Qu.:105.0   1st Qu.: 7.140  
##  Median :128.0   Median : 8.630  
##  Mean   :132.6   Mean   : 8.641  
##  3rd Qu.:163.0   3rd Qu.:10.160  
##  Max.   :248.0   Max.   :17.460
#From the summary it is understood that sold.by, publisher.type and genre are character variables and others are numeric variables.

#As publisher.type, genre and sold.by are columns based on different factors, it is better to convert this column into as factor from character format.
publishing.data$publisher.type <- as.factor(publishing.data$publisher.type)
levels(publishing.data$publisher.type)
## [1] "amazon"        "big five"      "indie"         "single author" "small/medium"
#Using levels it can be found there are 5 factors into which the data in publisher.type column are divided.

publishing.data$sold.by <- as.factor(publishing.data$sold.by)
levels(publishing.data$sold.by)
##  [1] "Amazon Digital Services,  Inc."       "Cengage Learning"                    
##  [3] "DC Comics"                            "Hachette Book Group"                 
##  [5] "HarperCollins Christian Publishing"   "HarperCollins Publishers"            
##  [7] "HarperCollins Publishing"             "Idea & Design Works"                 
##  [9] "Macmillan"                            "Penguin Group (USA) LLC"             
## [11] "Random House LLC"                     "RCS MediaGroup S.p.A."               
## [13] "Simon and Schuster Digital Sales Inc"
#Using levels it can be seen that there are 13 different sellers of books present in the dataset.

publishing.data$genre <- as.factor(publishing.data$genre)
levels(publishing.data$genre)
## [1] "childrens"   "fiction"     "non_fiction"
#Using levels it can be seen that there are books related to 3 different genres being sold in the store.

#Checking for NA values in specific columns
summarise_all(publishing.data, ~ sum(is.na(.x)))
## # A tibble: 1 × 7
##   sold.by publisher.type genre avg.review daily.sales total.reviews sale.price
##     <int>          <int> <int>      <int>       <int>         <int>      <int>
## 1       0              0     0          0           0             0          0
#since there are no NA values, there is no need of further cleaning for this dataset.

Do books from different genres have different daily sales on average?

#Null Hypothesis of the model: Books from different genres have the same daily sales on average
  
#Alternate Hypothesis of  model: Books from different genres have different daily sales on average

#Plotting daily sales according to different genres to check for distribution of the dales sales based on different genres.
genre.sales <- ggplot(publishing.data, aes(x=daily.sales, fill=genre)) + geom_histogram(position="identity", alpha=0.5, binwidth=10) + labs(x= 'Daily Sales',y='Frequency', title = 'Distribution of number of sales for different genres', caption = "Figure 1. Distribution of number of sales for different genres") + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, face = "bold"),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  ) + scale_fill_discrete(labels=c('Children', 'Fiction', 'Non-Fiction'))

genre.sales

#Visualizing Box Plot for daily sales
ggplot(publishing.data, aes(y = daily.sales)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "Box Plot of daily sales", y = "", caption= "Figure 2. Box Plot of daily sales")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))

#treating outliers:
publishing.data$daily.sales <- outlier.treatment(publishing.data$daily.sales)
#Formulating the regression model to calculate the effect of genre on daily sales
m.sales.by.genre <- lm(daily.sales ~ genre, data = publishing.data)
#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.genre)
## 
## Call:
## lm(formula = daily.sales ~ genre, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.219 -13.623   0.154  13.334  60.433 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       55.8970     0.4398  127.09   <2e-16 ***
## genrefiction      40.7922     0.6220   65.58   <2e-16 ***
## genrenon_fiction  19.9589     0.6220   32.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.67 on 5997 degrees of freedom
## Multiple R-squared:  0.4177, Adjusted R-squared:  0.4175 
## F-statistic:  2151 on 2 and 5997 DF,  p-value: < 2.2e-16
#Calculating the estimated marginal means for the model to get means for each genre of the model.
( m.sales.by.genre.emm <- emmeans(m.sales.by.genre, ~genre) )
##  genre       emmean   SE   df lower.CL upper.CL
##  childrens     55.9 0.44 5997     55.0     56.8
##  fiction       96.7 0.44 5997     95.8     97.6
##  non_fiction   75.9 0.44 5997     75.0     76.7
## 
## Confidence level used: 0.95
#Using pairs and confint within emmeans to find the 95% confidence intervals
( m.sales.by.genre.pairs <- confint(pairs(m.sales.by.genre.emm)) )
##  contrast                estimate    SE   df lower.CL upper.CL
##  childrens - fiction        -40.8 0.622 5997    -42.3    -39.3
##  childrens - non_fiction    -20.0 0.622 5997    -21.4    -18.5
##  fiction - non_fiction       20.8 0.622 5997     19.4     22.3
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
#Conducting Anova to test whether the variable 'genre' has a significant effect on 'daily sales'
anova(m.sales.by.genre)
## Analysis of Variance Table
## 
## Response: daily.sales
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## genre        2 1664260  832130  2150.7 < 2.2e-16 ***
## Residuals 5997 2320322     387                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Presenting plots for estimates for each genre and estimates of the difference between genres

#Plotting estimates for each genre
p.genre <- ggplot(summary(m.sales.by.genre.emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genre", y="Daily Sales", subtitle="Error Bars are Extent of 95% CIs", caption = "Figure 3(a). Estimated Means of daily sales for \ndifferent genres") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 9, face = "bold", hjust=0.5),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  )

#Plotting the differences in estimates based on genre
p.contrasts <- ggplot(m.sales.by.genre.pairs, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Contrast", y="Difference in Daily Sales", subtitle="Error Bars are Extent of 95% CIs", caption = "Figure 3(b). Estimated differences in daily sales \nmeans between genres") +
scale_x_discrete(guide = guide_axis(n.dodge=2)) + theme_classic() + theme(
    plot.title = element_text(color = "black", size = 16, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 9, face = "bold", hjust = 0.5),
    plot.caption = element_text(face = "italic", hjust = 0.25)
  )

#Using Grid Arrange to plot both plots simulataneously
grid.arrange(p.genre, p.contrasts, ncol=2, top=textGrob("Differences in daily sales number estimates based on genre", gp=gpar(fontsize=12, fontface= 2)))

Do books have more/fewer sales depending upon their average review scores and total number of reviews.

#Null Hypothesis of the model: Books have the same amount of sales depending on their average review scores and total number of reviews
  
#Alternate Hypothesis of  model: Books have fewer/more sales depending on their average review scores and total number of reviews


#Outlier inspection

#average review scores
ggplot(publishing.data, aes(y = avg.review)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "Box Plot of average reviews", y = "", caption= "Figure 4. Box Plot for average reviews")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))

#outlier treatment
publishing.data$avg.review <- outlier.treatment(publishing.data$avg.review)

#total number of reviews
ggplot(publishing.data, aes(y = total.reviews)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "Box Plot of total number of reviews", y = "", caption= "Figure 5. Box Plot for total number of reviews")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))

publishing.data$total.reviews <- outlier.treatment(publishing.data$total.reviews)
#Formulating a multi regression model to calculate the effect of average reviews and total number of reviews on daily sales
m.sales.by.review <- lm(daily.sales ~ avg.review + total.reviews, data = publishing.data)

#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.review)
## 
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.860 -14.209  -0.608  13.715  69.030 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   14.953805   3.974847   3.762  0.00017 ***
## avg.review     0.391633   0.875518   0.447  0.65466    
## total.reviews  0.446859   0.007189  62.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.1 on 5997 degrees of freedom
## Multiple R-squared:  0.3919, Adjusted R-squared:  0.3917 
## F-statistic:  1932 on 2 and 5997 DF,  p-value: < 2.2e-16
#Conducting Anova to test whether the variable 'total reviews' and 'average review' has a significant effect on 'daily sales'
anova(m.sales.by.review)
## Analysis of Variance Table
## 
## Response: daily.sales
##                 Df  Sum Sq Mean Sq   F value Pr(>F)    
## avg.review       1     488     488    1.2072 0.2719    
## total.reviews    1 1561020 1561020 3863.4562 <2e-16 ***
## Residuals     5997 2423074     404                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Using the `coef()` and `confint()` functions to get the confidence intervals for the estimation approach.
cbind(coefficient=coef(m.sales.by.review), confint(m.sales.by.review))
##               coefficient      2.5 %     97.5 %
## (Intercept)    14.9538048  7.1616760 22.7459335
## avg.review      0.3916333 -1.3246964  2.1079630
## total.reviews   0.4468588  0.4327654  0.4609523
#To check for interaction effect a regression model with an interaction term can be formulated.

#Formulating a multi regression model to calculate the effect of average reviews and total number of reviews on daily sales as an interaction term.
m.sales.by.review.interaction <- lm(daily.sales ~ avg.review*total.reviews, data = publishing.data)
#Using summary to find the coefficient values and p-values gained from running the model.
summary(m.sales.by.review.interaction)
## 
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.855 -14.206  -0.576  13.749  69.030 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              22.21342   14.84263   1.497 0.134551    
## avg.review               -1.25373    3.35731  -0.373 0.708840    
## total.reviews             0.39240    0.10751   3.650 0.000264 ***
## avg.review:total.reviews  0.01234    0.02431   0.508 0.611717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.1 on 5996 degrees of freedom
## Multiple R-squared:  0.3919, Adjusted R-squared:  0.3916 
## F-statistic:  1288 on 3 and 5996 DF,  p-value: < 2.2e-16
#Conducting Anova to test whether the variable 'total reviews' and 'average review' has a significant effect on 'daily sales' when used as an interaction term
anova(m.sales.by.review.interaction)
## Analysis of Variance Table
## 
## Response: daily.sales
##                            Df  Sum Sq Mean Sq   F value Pr(>F)    
## avg.review                  1     488     488    1.2070 0.2720    
## total.reviews               1 1561020 1561020 3862.9780 <2e-16 ***
## avg.review:total.reviews    1     104     104    0.2577 0.6117    
## Residuals                5996 2422970     404                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Using the `coef()` and `confint()` functions to get the confidence intervals for the estimation approach.
cbind(coefficient=coef(m.sales.by.review.interaction), confint(m.sales.by.review.interaction))
##                          coefficient      2.5 %      97.5 %
## (Intercept)              22.21342362 -6.8834794 51.31032660
## avg.review               -1.25372619 -7.8352639  5.32781148
## total.reviews             0.39240480  0.1816513  0.60315826
## avg.review:total.reviews  0.01233938 -0.0353108  0.05998957
#checking for multicolinearity in the model
vif(m.sales.by.review)
##    avg.review total.reviews 
##       1.00011       1.00011

Effect of sale price upon the number of sales for different genres.

#determining outliers

ggplot(publishing.data, aes(y = sale.price)) + 
  geom_boxplot(fill="grey", outlier.fill = "red", outlier.colour = "red")+
  labs(title = "Box Plot of sales price", y = "", caption= "Figure 6. Box Plot for sales price")+
  theme_classic() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        title= element_text(color = "black", size = 8, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0.25))

#treating outliers

publishing.data$sale.price <- outlier.treatment(publishing.data$sale.price)


#Creating a multi regression model to calculate the effect of sale price on daily sales
m.sales.by.saleprice <- lm(daily.sales ~ sale.price, data = publishing.data)
summary(m.sales.by.saleprice)
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.495 -17.850  -2.971  15.564  70.537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.3353     1.4780   70.59   <2e-16 ***
## sale.price   -3.2750     0.1676  -19.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.99 on 5998 degrees of freedom
## Multiple R-squared:  0.05987,    Adjusted R-squared:  0.05971 
## F-statistic: 381.9 on 1 and 5998 DF,  p-value: < 2.2e-16
anova(m.sales.by.saleprice)
## Analysis of Variance Table
## 
## Response: daily.sales
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## sale.price    1  238546  238546  381.95 < 2.2e-16 ***
## Residuals  5998 3746036     625                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.saleprice.emm <- emmeans(m.sales.by.saleprice, ~sale.price))
##  sale.price emmean    SE   df lower.CL upper.CL
##        8.61   76.1 0.323 5998     75.5     76.8
## 
## Confidence level used: 0.95
#Creating a multi regression model to calculate the effect of sales price on daily sales for different genres
m.sales.by.saleprice.genre <- lm(daily.sales ~ sale.price + genre, data = publishing.data)
summary(m.sales.by.saleprice.genre)
## 
## Call:
## lm(formula = daily.sales ~ sale.price + genre, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.222 -13.572   0.048  13.194  59.328 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       60.9984     1.4549  41.927  < 2e-16 ***
## sale.price        -0.5272     0.1433  -3.678 0.000237 ***
## genrefiction      39.9723     0.6601  60.551  < 2e-16 ***
## genrenon_fiction  19.0866     0.6651  28.698  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.65 on 5996 degrees of freedom
## Multiple R-squared:  0.419,  Adjusted R-squared:  0.4187 
## F-statistic:  1441 on 3 and 5996 DF,  p-value: < 2.2e-16
anova(m.sales.by.saleprice.genre)
## Analysis of Variance Table
## 
## Response: daily.sales
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## sale.price    1  238546  238546  617.82 < 2.2e-16 ***
## genre         2 1430938  715469 1853.03 < 2.2e-16 ***
## Residuals  5996 2315098     386                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.saleprice.genre.emm <- emmeans(m.sales.by.saleprice.genre, ~sale.price+genre))
##  sale.price genre       emmean    SE   df lower.CL upper.CL
##        8.61 childrens     56.5 0.465 5996     55.5     57.4
##        8.61 fiction       96.4 0.445 5996     95.6     97.3
##        8.61 non_fiction   75.5 0.447 5996     74.7     76.4
## 
## Confidence level used: 0.95
( m.sales.by.saleprice.genre.pairs <- confint(pairs(m.sales.by.saleprice.genre.emm)) )
##  contrast                                                                      estimate    SE   df
##  sale.price8.60705074730622 childrens - sale.price8.60705074730622 fiction        -40.0 0.660 5996
##  sale.price8.60705074730622 childrens - sale.price8.60705074730622 non_fiction    -19.1 0.665 5996
##  sale.price8.60705074730622 fiction - sale.price8.60705074730622 non_fiction       20.9 0.622 5996
##  lower.CL upper.CL
##     -41.5    -38.4
##     -20.6    -17.5
##      19.4     22.3
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
# constructing linear regression model using interaction term.
m.interaction <- lm(daily.sales ~ sale.price*genre, data = publishing.data)
summary(m.interaction)
## 
## Call:
## lm(formula = daily.sales ~ sale.price * genre, data = publishing.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.089 -13.654   0.154  13.207  57.498 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  69.4396     2.4672  28.146  < 2e-16 ***
## sale.price                   -1.3995     0.2509  -5.578 2.54e-08 ***
## genrefiction                 26.6920     3.2363   8.248  < 2e-16 ***
## genrenon_fiction              8.5527     3.1654   2.702  0.00691 ** 
## sale.price:genrefiction       1.4681     0.3557   4.127 3.72e-05 ***
## sale.price:genrenon_fiction   1.1332     0.3479   3.257  0.00113 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.62 on 5994 degrees of freedom
## Multiple R-squared:  0.4208, Adjusted R-squared:  0.4203 
## F-statistic:   871 on 5 and 5994 DF,  p-value: < 2.2e-16
anova(m.interaction)
## Analysis of Variance Table
## 
## Response: daily.sales
##                    Df  Sum Sq Mean Sq   F value    Pr(>F)    
## sale.price          1  238546  238546  619.5580 < 2.2e-16 ***
## genre               2 1430938  715469 1858.2369 < 2.2e-16 ***
## sale.price:genre    2    7255    3627    9.4212  8.22e-05 ***
## Residuals        5994 2307844     385                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(m.sales.by.saleprice.genre.emm.i <- emmeans(m.interaction, ~sale.price*genre))
##  sale.price genre       emmean    SE   df lower.CL upper.CL
##        8.61 childrens     57.4 0.514 5994     56.4     58.4
##        8.61 fiction       96.7 0.456 5994     95.8     97.6
##        8.61 non_fiction   75.7 0.461 5994     74.8     76.6
## 
## Confidence level used: 0.95
anova(m.sales.by.saleprice, m.sales.by.saleprice.genre)
## Analysis of Variance Table
## 
## Model 1: daily.sales ~ sale.price
## Model 2: daily.sales ~ sale.price + genre
##   Res.Df     RSS Df Sum of Sq    F    Pr(>F)    
## 1   5998 3746036                                
## 2   5996 2315098  2   1430938 1853 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Section 2

This report is designed to meet the requests of managers of a publishing company, undertaking the specific analyses requested. This reports aims to:

  1. Examine whether books from different genres have different daily sales.

  2. Examine whether the sales of the books depend on their average review scores and total number of reviews.

  3. Examine the effect of sales price of the book on the number of sales for the book and if it differs for books belonging to different genres.

Evaluating the Impact of Genre on Daily Sales:

The graph in Figure 1 displays the distribution of daily sales across different genres. As depicted, the Mean of the distributions varies depending on the genre.

An ANOVA analysis was conducted and it was found that daily sales of books differ greatly based on genre, \(F(5997)= 2150.7\), \(p < 0.0001\). Furthermore, when comparing the mean value of daily sales for children (55.6) to other genres, it was determined that fiction has a significantly higher daily sales average of 50.30, $t(5997) = 71.53, p < 0.0001. Similarly, the average daily sales for non-fiction was found to be 20.28 higher than that of children, \(t(5997) = 28.85, p <0.001\). Thus, through Null Hypothesis significance testing, it was rejected that the true mean values of daily sales are the same for all genres.

The data analyzed in Figure 3(a) and Figure 3(b) both demonstrate that there is a significant difference in the mean daily sales of books among different genres. The 95% confidence intervals of the population means do not overlap, indicating that it is highly unlikely that the true population means for the daily sales of each genre are the same. Additionally, the figure illustrates that the differences between the means are substantial. These findings lead us to reject the null hypothesis that the average daily sales for all genres are identical, and instead suggest that books from different genres have distinct daily sales on average.

An examination was conducted to determine the influence of the average review score and the total number of reviews on the daily sales of books. The results show that when keeping the average review constant, the daily sales of books vary greatly depending on the total number of reviews, \(t(5997) = 62.16, p < 0.0001\). This suggests that on average, the daily sales will increase by 0.44 e-books for each unit increase in total reviews, while holding the average review constant. Additionally, the estimated 95% CI range for the change of daily sales for every 1 unit increase in the number of ratings, while keeping the average reviews constant, does not include 0, indicating that for the population data, an increase in one unit of total reviews will increase the daily sales. However, when keeping the total number of reviews constant, the daily sales of books do not vary significantly depending on the average reviews, \(t(5997) = 0.44, p > 0.05\). Thus, we fail to reject the null hypothesis that there is no correlation between average review and daily sales when controlling for total reviews. Furthermore, the estimated 95% CI range for the change of daily sales for every 1 unit increase in the average review, while keeping total number of reviews constant, does include 0. This suggests that for the population data, an increase in one unit of average review is unlikely to change daily sales when controlling for total number of reviews. In conclusion, this analysis demonstrates that the number of ratings is more important than the average rating when it comes to increasing the daily sales of books.

An additional examination was carried out to investigate whether the influence of the number of reviews on daily sales varied for different average reviews. The results of the analysis revealed that the effect was not statistically different for various average reviews, \(t(5996) = 0.508, p > 0.05\). Therefore, we failed to reject the null hypothesis that the impact of the number of reviews on sales does not rely on the average review of the book. Furthermore, the estimated 95% CI range of the change in the effect of the number of reviews on daily sales for every 1 unit increase in the average review includes 0. This suggests that for the population data, it is highly probable that an increase of 1 unit in the average review will not affect the effect of the number of reviews on daily sales.

Examine the effect of sales price of the book on the number of sales for the book and if it differs for books belonging to different genres.

Initially, an analysis was conducted to investigate the relationship between sale price and daily sales. It was seen that on average, the daily sales would drop by 3.8 e-books for every 1 dollar rise in the retail price \(t(5998)= -18.78, p < 0.0001\). The estimated 95% CI of the change in daily sales for every $1 increase in the retail price does not include 0, further supporting the negative relationship. This indicates that the slope of the best fit regression line will be negative for the population data as well.

Finally, a more complex regression analysis was conducted to determine if the relationship between sales price and daily sales differs among different genres of e-books. The results of the analysis revealed that there is a significant difference in the effect of sales price on daily sales for different genres, F(5994) = 9.42, p < 0.001. Specifically, the analysis found that the difference in the increase of daily sales for a 1 unit decrease in sales price between non-fiction and children’s genres is 1.13, which is statistically significant, \(t(5994) = 3.257, p < 0.01\). Additionally, the difference in the increase of daily sales for a 1 unit decrease in sales price between fiction and children’s genres is 1.46, which is also statistically significant, \(t(5994) = 4.127, p < 0.001\). These results lead us to reject the null hypothesis that the relationship between sales price and daily sales is the same for all genres. Furthermore, the estimated 95% CI range of the change in the effect of sales price on daily sales for different genres does not include 0, indicating that the relationship between sales price and daily sales is likely to differ among different genres in the population.